Getting ‘ϕψχal’ with proteins: minimum message length inference of joint distributions of backbone and sidechain dihedral angles

Abstract   The tendency of an amino acid to adopt certain configurations in folded proteins is treated here as a statistical estimation problem. We model the joint distribution of the observed mainchain and sidechain dihedral angles (〈ϕ,ψ,χ1,χ2,…〉) of any amino acid by a mixture of a product of von Mises probability distributions. This mixture model maps any vector of dihedral angles to a point on a multi-dimensional torus. The continuous space it uses to specify the dihedral angles provides an alternative to the commonly used rotamer libraries. These rotamer libraries discretize the space of dihedral angles into coarse angular bins, and cluster combinations of sidechain dihedral angles (〈χ1,χ2,…〉) as a function of backbone 〈ϕ,ψ〉 conformations. A ‘good’ model is one that is both concise and explains (compresses) observed data. Competing models can be compared directly and in particular our model is shown to outperform the Dunbrack rotamer library in terms of model complexity (by three orders of magnitude) and its fidelity (on average 20% more compression) when losslessly explaining the observed dihedral angle data across experimental resolutions of structures. Our method is unsupervised (with parameters estimated automatically) and uses information theory to determine the optimal complexity of the statistical model, thus avoiding under/over-fitting, a common pitfall in model selection problems. Our models are computationally inexpensive to sample from and are geared to support a number of downstream studies, ranging from experimental structure refinement, de novo protein design, and protein structure prediction. We call our collection of mixture models as PhiSiCal (ϕψχal). Availability and implementation PhiSiCal mixture models and programs to sample from them are available for download at http://lcb.infotech.monash.edu.au/phisical.


Introduction
The 20 naturally occurring amino acids form the nature's part list from which proteins are made within the cells of organisms. In all amino acids a central carbon atom (the acarbon) binds an amino group (-NH 2 ), a carboxylic acid (-COOH) group, and a hydrogen atom, but differ in the fourth group attached, a sidechain (R).
Protein polypeptide chains of amino acids fold into compact three-dimensional shapes stabilized by inter-atomic interactions between the amino acids. The resultant amino acid conformations are determined by the varying degrees of rotations ('torsions') around the atomic bonds, subject to the physics and chemistry of protein folding.
Any torsion can be mathematically calculated as a 'dihedral angle'-the angle between two planes-defined by four points (here, the coordinates of successively bonded atoms) sharing a common basis vector (here, the central bond around which the torsion is being measured) (IUPAC-IUB Commission, 1970). Thus, any amino acid conformation can be described as a vector of dihedral angles, conventionally denoted by the sequence of symbols, h/; w; x; v 1 ; v 2 ; . . .i (see Fig. 1).
Across all amino acids, the symbols h/; w; xi are used to denote the dihedral angles around the backbone bonds, whereas hv 1 ; v 2 ; . . .i are used to denote exclusively the torsions around the sidechain bonds. Note that the number of sidechain dihedral angles depends on the sidechain (R) groups, and hence varies with the amino acid type.
Analysis of the observed distributions of backbone and sidechain dihedral angles has been an object of intense interest since the early protein structural and biophysical studies: Ramachandran et al. (1963), Janin and Wodak (1978), McGregor et al. (1987), Dunbrack and Karplus (1993), Dunbrack and Cohen (1997), Dunbrack (2002), and Dunbrack (2007, 2011). This interest is fuelled by the need for accurate statistical models that can effectively characterize the observed dihedral angle distributions of proteins, as these models are used by techniques for protein experimental structure determination, computational prediction, rational design, and many other protein structural analyses.
One of the results has been the creation of rotamer libraries. A 'rotamer' is any rotational preference of the set of dihedral angles along the sidechain bonds within amino acids. These libraries are compiled from the statistical clustering of sidechain conformations of known protein structures (Dunbrack 2002). Rotamer libraries are 2-fold: backbone independent and backbone dependent. Backbone-dependent rotamer libraries contain rotameric preferences conditioned on any observed backbone dihedral angles (Dunbrack and Karplus 1993;Dunbrack and Cohen 1997;Shapovalov and Dunbrack 2011), and differ from the backbone-independent libraries which simply cluster sidechain conformations agnostic to the backbone conformation of amino acids (Ponder and Richards 1987;Lovell et al. 2000).
Rotamer libraries derive sidechain conformation statistics using coarse quantization of the observed rotation space for each sidechain dihedral angle. This discretization often uses an angular interval of 120 regions, yielding a ðÀ60 ; 60 ; 180 Þ trisection of the rotational space, that corresponds to the staggered conformation of two sp 3 -hybridized atoms (Dunbrack 2002). Under such a discretization, each rotamer clusters around a mean conformational preference over a discretized interval. Such rotameric descriptions of sidechain torsions have the advantage of yielding a computationally tractable conformation space when inferring rotational preferences of individual amino acids and fitting them in several protein modelling tasks [e.g. in de novo protein design (Desmet et al. 1992)]. However, such discretizations can also bias downstream studies, e.g. leading to inaccurate modelling of the details of inter-atomic interactions for protein docking (Wang et al. 2005), and to imprecise protein conformational energy landscapes (Grigoryan et al. 2007), among others (Lassila 2010). Further, several of the outermost dihedral angles of certain amino acidsv 3 of glutamic acid (GLU) and glutamine (GLN), v 2 of aspartic acid (ASP), and asparagine (ASN)flout the three-way discretization of its rotational space and hence lead to broad and visually featureless distributions that have resisted attempts to characterize the observed spread accurately (Lovell et al. 1999;Shapovalov and Dunbrack 2011). As discussed by Schrauber et al. (1993), in these instances the rotameric representation of sidechain conformations is limited and large deviations of v angles from the canonical values can be observed. The existence of such 'nonrotameric' conformations was also discussed in detail by Heringa and Argos (1999).
An approach employed to mitigate this issue is to calculate distribution frequencies on a finer grid (Schrauber et al. 1993). A more accurate approach is to model the distribution over a continuous space, as this would result in a finer representation minimizing information loss. This is the approach taken by BASILISK (Harder et al. 2010) which formulates a probabilistic model that represents the torsion angles in a continuous space. However, it uses a single probabilistic model for all the amino acids.
The Dunbrack rotamer library (Dunbrack and Karplus 1993;Dunbrack and Cohen 1997;Dunbrack 2002;Dunbrack 2007, 2011) is a continually maintained and improved rotamer library. It defines the state of the art and is among the most widely used rotamer libraries across many downstream applications that employ them. While this library is backbone dependent, it uses the same supervised-discretized choices. This discretization renders their resultant models both overly complex as well as inaccurate in capturing the observed distributions of dihedral angles when sampled from its libraries (see Section 3).
In this work, we take a different approach by modelling the joint distributions of the observed mainchain and sidechain dihedral angles of individual amino acids by a mixture of a product of von Mises probability distributions. To infer these mixture models, we use the Bayesian and informationtheoretic criterion of minimum message length (MML) (Wallace and Boulton 1968;Wallace and Freeman 1987;Wallace 2005). In the theory of learning and generalization, this unsupervised model selection framework falls under the class of statistical inductive inference (Wallace 2005). Among other notable and well-established statistical properties, MML allows an objective trade-off between model complexity and fit-these form two opposing criteria that all model selection problems contend with, but for which MML provides an intuitive, objective, and rigorous reconciliation.
We compared our mixture models inferred for each amino acid with the Dunbrack rotamer library on large datasets containing structures that are non-redundant in sequence and filtered based on high-resolution, B-factor, and R-factor cutoffs. Our results clearly demonstrate that the mixture models we infer outperform the Dunbrack rotamer library both in its model complexity (by three orders of magnitude) and its fidelity (yielding on average 20% more lossless compression) when explaining the observed dihedral angle data. Our MML mixture model library, termed '/wval' supports fast sampling of joint and conditionally distributed dihedral angle vectors to support their use in many downstream studies involving protein structures.

Mixture model overview
We present a systematic method of 'unsupervised' estimation of a statistical model that can effectively explain any given observations of 'vectors' (of any dimension) of dihedral angles using the statistical inductive inference framework of MML (Wallace and Boulton 1968;Wallace 2005;Allison 2018).
Specifically, this work infers a 'mixture model' under the Bayesian and information-theoretic criterion of MML, where each component of the mixture defines a 'product' of a series of von Mises distributions (Mardia et al. 2000), one for each dihedral angle observed in the specified amino acid. We note that the number of components, their probabilities, and corresponding parameters are all unknown and are inferred unsupervised by our method.
Formally, for a specified amino acid 'aa' (i.e. any of the 20 naturally occurring amino acids in proteins), X ¼ fx 1 ; x 2 ; . . . ; x N g represents an input set of N observations of the conformational . (The angles shown above are those observed for MET67 in the fibroblast growth factor protein, 1BAR. Note that the value of v 1 ¼ À180 for the C a -C b bond corresponds to the trans conformation.) For MET, the sidechain, or R group, is - i358 Amarasinghe et al.
states of that amino acid. Each x i 2 X defines a vector of the d dihedral angles (whose terms are specified in some canonical order) as observed in the i-th instance of 'aa'. For example, each instance of the amino acid methionine (see Fig. 1) is defined by a d ¼ 6-dimensional vector containing its dihedral angles h/; w; x; v 1 ; v 2 ; v 3 i. In this case, X captures the set of observed instances of various conformational states of methionine derived from a non-redundant set of experimental coordinates in the world-wide protein data bank (Berman et al. 2000). A 'mixture model' is any convex combination of 'component' probability density functions used to explain some observed data containing a number of subpopulations (often unknown in advance) within an overall population (Figueiredo and Jain 2002;McLachlan et al. 2019). Specifically, in this work, we consider a mixture model that takes the general form: This defines a continuous probability distribution for a d-dimensional random vector Thus, the support for x i defines a surface of a d-Torus (denoted as T d ). jMj 2 Z þ denotes the size of the mixture model given by the number of 'components' it defines. Each component function f ðH j Þ denotes the joint probability distribution of the random vector x i 2 T d . In this work, each mixture component takes the form of a product of d von Mises circular distributions, f ðH j Þ / Q d p¼1 exp ðj jp cosðx ip À l jp ÞÞ; where each hl jp ; j jp i represent the hmean; concentrationi parameters of each von Mises term in the product and H j ¼ fhl jp ; j jp ig 81 p d denotes the collection of all von Mises' parameters of the j-th mixture component. Each w j denotes a mixture components' respective 'weight' which, over all jMj terms in the mixture, add up to 1. Finally, we use K as a shorthand to collectively denote all mixture model's parameters: 1) the 'number' of mixture components jMj, 2) the set of 'weights' of mixture components fw j g 81 j jMj , and 3) the set of all parameters defining the mixture 'components' fH j g 81 j jMj ffhl j p ; j jp ig 81 p d g 81 j jMj .
Thus, for any specified amino acid 'aa' with its given set of dihedral angle tuples X, the goal of this work is to infer a mixture model M that best explains all the observations in X.
The key challenge in doing so is to estimate the mixture parameters K unsupervised. To address this unsupervised estimation problem, we employ the Bayesian and informationtheoretic criterion of MML, as follows.

MML inference foundations
2.2.1 MML and model selection MML is a Bayesian method for hypothesis/model selection. In general terms, if X is some given data and M is some statistical model describing that data, the joint probability of the model M and data X is given by the product rule of probability: PrðM; XÞ ¼ PrðMÞPrðXjMÞ. This can be recast in terms of Shannon information based on the observation that the optimal code length to represent any event E (with a probability PrðEÞ) is given by the measure of Shannon information content quantified (say in bits of information) as IðEÞ ¼ À log 2 ðPrðEÞÞ (Shannon 1948). Expressing the above product rule of probability in terms of Shannon information content, we get: In the above equation, the amount of information required to losslessly explain the observed data X with a hypothesis/ model M can be seen as the length of a two-part message: the 'first part' contains the information required to state the model M losslessly (quantifying the model's descriptive 'complexity'), whereas the 'second part' contains the information required to state the data X 'given' the model M (quantifying the model's 'fit' with the data). It is easy to see that, in this information-theoretic view, the best model M Ã is the one whose total two-part message is minimum (optimally tradingoff the model's complexity and fit): M Ã ¼ arg min 8M IðM; XÞ. This is equivalent to maximizing the joint probability arg max 8M PrðM; XÞ. Thus, under the MML framework, any pair of competing models explaining the same data can be compared based on their respective total lengths: the difference in total message lengths derived using any two models gives their log-odds posterior ratio, making this method of model selection Bayesian (Wallace 2005;Allison 2018).

Wallace-Freeman method of parameter estimation using MML
Let MðaÞ denote a twice-differentiable statistical model with a parameter vector a (with jaj number of free parameters) and X denote some observed data (containing jXj number of observations). Wallace and Freeman (1987) showed that the total message length of any general model M with a vector of parameters a can be approximated as ( where hðaÞ is the prior probability density of the parameters a, detðF ðaÞÞ is the determinant of the 'expected' Fisher information matrix, LðaÞ is the negative log-likelihood function of X given a, q jaj represents the Conway-Sloane (Conway and Sloane 1984) lattice quantization constant in jaj-dimensional space, and is the uncertainty of each datum in the set X of size jXj. Refer to Wallace (2005) and Allison (2018) for details of this method of estimation. This Wallace and Freeman (1987) method informs the computation of various message length terms in the work presented here.

Message length of a mixture model
Applying the general MML framework to the mixture models introduced in Section 2.1 allows us to characterize the length of the message needed to explain jointly any observed set of dihedral angle vectors X using a mixture model M with parameter vector K analogously to Equation (2) as This in turn is used to define the objective function we use to estimate an optimal set of mixture model parameters that can losslessly explain itself (MðKÞ) and the observations X in the most succinct way in terms of Shannon information: K MML ¼ arg min 8K IðMðKÞ; XÞ.

Computing IðMðKÞÞ term of Equation (4)
As described in Section 2.1, K denotes the combined set of mixture model parameters ðjMj; fw j g 81 j jMj ; fH j g 81 j jMj Þ. Thus, the Shannon information content in a mixture model can be expressed as the summation of the message lengths terms required to state all its parameters losslessly: Computation of each of the message length terms on the right-hand side of Equation (5) is described below.
Computation of Term 1 of Equation (5) jMj 2 Z þ is a countable positive integer and thus can be stated using an universal prior for integers over a variablelength integer code (Allison et al. 2019). We employ the Wallace Tree Code (Wallace and Patrick 1993;Allison et al. 2019) to compute IðjMjÞ in Equation (5).
Computation of Term 2 of Equation (5) The set of L 1 normalized weight vector fw j g 81 j jMj can be viewed as a parameter of a multinomial distribution, whose support defines a unit ðjMj À 1Þ simplex (Wallace 2005;Allison 2018). Using the Wallace-Freeman method of estimation described in Section 2.2.2, assuming a uniform prior for the weights as a point in a unit ðjMj À 1Þ simplex, i.e. the prior h ¼ ðjMj À 1Þ!= ffiffiffiffiffiffiffiffi ffi jMj p , and computing the determinant of the Fisher information matrix for a multinomial distribution (with parameters fw j g) as N jMjÀ1 =P jMj j¼1 w j , it can be shown [as per the first part of Equation (3)] that the message length of Term 2 is given by (Allison 2018): Computation of Term 3 of Equation (5) Recall (from Section 2.1) that each H j ¼ fhl jp ; j jp ig 81 p d .
Thus, IðH j Þ ¼ P d p¼1 Iðhl jp ; j j p igÞ. Each Iðhl jp ; j j p igÞ term in the summation is estimated by again applying the Wallace-Freeman method (Section 2.2.2), this time for a von Mises circular distribution. A von Mises distribution defines a probability distribution of a random variable x on a circle (i.e. x 2 ðÀp; p) as a function of its two free parameters, mean l 2 ðÀp; p and concentration j > 0: f ðx; hl; jiÞ ¼ exp j cosðxÀlÞ 2pB 0 ðjÞ ; where the denominator on the right-hand side gives the normalization constant of the distribution in terms of the modified Bessel function (of order 0), denoted here as B 0 ðjÞ. More commonly, modified Bessel functions of order r are denoted as I r ðÁÞ. We use B r here only to avoid confusion with the Shannon information content notation, IðÁÞ.
In applying the Wallace-Freeman method, the assumed priors for the two parameters are [as per Kasarapu and Allison (2015)]: hðlÞ ¼ 1 2p and hðjÞ ¼ j . Thus, hðhl; jiÞ ¼ hðlÞhðjÞ. We note that the rationale and behaviour of these priors for von Mises has been previously studied (Wallace 2005). The chosen prior on l is uniform (and hence uninformative/flat), giving only general information about the variable being estimated, which makes it suitable. On the other hand, no truly uninformative prior exists for j. The chosen prior ensures the function is smooth (without singularities) and commonly preferred when the data concentration is expected to arise from physical interactions (Wallace 2005). Further, for some N observations of circular angles in the range ðÀp; p defined by (say) the set X ¼ fx 1 ; x 2 ; . . . ; x N g, it can be shown that the 'determinant' of the expected Fisher information matrix for a von Mises distribution can be characterized as detðF ðhl; jiÞÞ ¼ jNAðjÞA 0 ðjÞ, where AðjÞ ¼ B1ðjÞ B0ðjÞ and A 0 ðjÞ ¼ d dj AðjÞ. Using this prior and determinant, the message length term to state the pair of hl; ji parameters of any single von Mises circular distribution [as per the first part of Equation (3)] can be written as Iðhl; jiÞ ¼ logðq 2 Þ À logðhðhl; jiÞÞ þ 1 2 logðdetðF ðhl; jiÞÞÞ: 2.3.2 Computing IðXjMðKÞÞ term of Equation (4) The second part of Equation (4) deals with explaining the observations of the vectors of dihedral angles X using the mixture model parameters that have been stated losslessly via the first part (Section 2.3.1). Using the relationship between Shannon information and probability (Section 2.1), that is, IðÁÞ ¼ ÀlogðPrðÁÞÞ; IðXjMðKÞÞ can be decomposed using the likelihood of each d-dimensional dihedral angle x ip 2 x i 2 X (assuming independent and identically distributed datum) using the mixture model parameters as where in the above expression denotes the degree of uncertainty of each dihedral angle x ip to estimate its component likelihood over a von Mises distribution. This work sets ¼ 0:0873 radians, based on the observation that the effective precision of 3D atomic coordinate is not better than 0:1Å (Konagurthu et al. 2014). i360 Amarasinghe et al.

2.4
Search for optimal mixture model parameters 2.4.1 Expectation-maximization (EM) To search for an optimal mixture model MðK MML Þ that minimizes Equation (4), we employ a deterministic EM algorithm commonly employed for statistical parameter estimation problems (Dempster et al. 1977;McLachlan and Basford 1988;McLachlan et al. 2019). EM is an iterative algorithm which, in each iteration, explores local updates to the current parameter estimates to be able to generate new parameter estimates that yield progressively shorter message lengths [in this work, the evaluation of Equation (4)] until convergence. Let KðtÞ denote the state of the mixture parameters at an iteration indexed by t ! 0. Then at each iteration indexed as f1; 2; . . . ; t; t þ 1; . . .g the EM performs an E(xpectation)-step followed by a M(aximization)-step, as described below.

E-step
Using the current state of parameter estimates after iteration t, i.e. KðtÞ, the E-step calculates the (probabilistic) 'responsibilities' r ij ðt þ 1Þ81 i N; 1 j jMj in the next iteration t þ 1 as Formally responsibility r ij is the posterior probability that x i belonging to j and it quantifies the degree to which a component j 'explains' the data point x i (McLachlan et al. 2019). From these responsibilities, given N observations of dihedral angles, any j-th component's membership in iteration t þ 1 is calculated as r ij ðt þ 1Þ and X jMj j¼1 n j ðt þ 1Þ ¼ N:

M-step
In the M-step, the mixture parameters are updated as follows.
The set of weights for t þ 1 are derived as the MML estimates of parameters of a multistate distribution (Allison 2018) with N observations over jMj distinct states while treating fn j ðt þ 1Þg 81 j jMj as each component/state's number of observed instances (out of N): Further, the update to each mean parameter of a von Mises distribution (81 j jMj; 1 p d) is given by where R jp is the 'vector sum' of each x ip th dihedral angle in the tuple x i 2 X, weighted by its corresponding responsibility r ij ðt þ 1Þ. We note that this vector sum arises because each dihedral angle is written as a 2D trigonometric coordinate ðcos x ip ; sin x ip Þ on a unit circle. kR jp k is the vector norm of the resultant vector R j p .
Finally, the update to the concentration parameter j jp of von Mises distribution (81 j jMj; 1 p d) follows a numerical approach, as solving for the roots of @ @j Iðhl; ji; X p Þ ¼ 0 has no closed form (see Supplementary Section S1).

Search for the optimal number of mixture components, jMj
A priori, the number of mixture components jMj is unknown, along with other mixture parameters. Thus, the EM algorithm starts with a single component mixture model at iteration t ¼ 0 (i.e. jMj ¼ 1). It then follows similar mechanics to that described by Kasarapu and Allison (2015), albeit with some improvements.
Starting from a single-component mixture at t ¼ 0; during each iteration ðt þ 1Þ, a set of perturbations, Split, Merge, and Delete are systematically executed on each component of the mixture model KðtÞ. We note that each Split of a component increases the number of components jMj by þ1, whereas Merge and Delete decrease it by À1. After each such perturbation, the parameters of the resulting new mixture (with increased/decreased number of components) are reestimated using EM updates described in Section 2.4.1 starting with initial parameters assigned deterministically at the E-step. After systematically exploring all of the above perturbations on each component, the perturbation that yields the best improvement to the message length [as per Equation (4)] is chosen going into the next iteration, and so on, until convergence.
The rationale of each Split, Merge, and Delete operations together with the full details of their mechanics are provided in Supplementary Section S2. Furthermore, Supplementary Section S11 demonstrates the stability and convergence of this search process.

Datasets and benchmarks
3.1.1 Curating the dihedral angle datasets Atomic coordinates of 38,895 protein structures with nonredundant amino acid sequences ( 50% sequence identity) were derived from the Protein Data Bank (Berman et al. 2000), considering only structures with an R-factor cut-off at 0.3 and resolution cut-off at 3.5 Å or better. We call this collection PDB50. Further, as a way to test the effect that precision of input data has on the inferred models, we also consider another ( 50% sequence identity) dataset containing 9568 high-resolution ( 1:8 Å ) X-ray structures with a B-factor cut-off of 40 and R-factor cut-off of 0.22. We call this collection PDB50HighRes.
For a complete atomic coordinate record of each amino acid observed in any considered structure, we calculate a vector of backbone and sidechain dihedral angles: f/; w; x; v 1 ; v 2 ; . . .g. (We note that the partial double-bond characteristic of peptide bond makes x typically $180 and rarely $0 . Thus, for our inference, x dihedrals were ignored from the input set.) Overall, this resulted in 22,177,093 observations (vectors of dihedral angles) from PDB50 and 3,774,207 observations for PDB50HighRes, considering only the atomic coordinates of 20 natural amino acids within proteins. We then partitioned these observations into 20 sets of amino acid specific dihedral angle vectors (X ðaaÞ ), one for each distinct amino acid (aa). Table 1 gives the breakdown of the number of observations per amino acid type, along with their corresponding number of (backbone þ sidechain) dihedral angles. For each of these amino acid specific input sets X ðaaÞ , its corresponding mixture model MðK ðaaÞ Þ (one for PDB50 dataset and another for PDB50HighRes dataset) was inferred and their parameters estimated automatically using the MML methodology (described in Section 2).

Dunbrack backbone-dependent rotamer libraries
We benchmark the performance and fidelity of our inferred mixture models against the latest version of the Dunbrack 'backbone-dependent' rotamer (sidechain conformation) libraries (Shapovalov and Dunbrack 2011), across varying degrees of smoothing [2%, 5% (default), 10% and 20%] that those libraries provide. The Dunbrack libraries define the state of the art for modelling and sampling sidechain conformations, 'conditioned' on any stated backbone dihedral angles h/; wi. Specifically, the Dunbrack rotamer library discretizes each amino acid's backbone dihedral angles h/; wi into 36 2 ¼ 1296 bins (of 10 Â 10 granularity). For each h/; wi bin, there are commonly 3 m models. Here, 3 arises from the three-way discretization of each sidechain dihedral angle into fgaucheþ (gþ), trans (t), gauche-(g-)g states, whereas m denotes the number of 'sidechain' dihedral angles hv 1 ; v 2 ; . . .i in that amino acid. For example, amino acid, methionine has m ¼ 3 and the Dunbrack rotamer library lists 36 Â 36 Â 3 3 ¼ 34; 992 models across its 1296 possible h/; wi bins. The Dunbrack rotamer library divides the set of amino acid types into 'rotameric' and 'non-rotameric' categories. The use of the closed-form computation of 3 m models holds for all 'rotameric' amino acids, whereas the 'non-rotameric' amino acids (glutamic acid, glutamine, aspartic acid, asparagine, tryptophan, histadine, tyrosine, and phenylalanine) have more components, as some of their sidechain dihedrals do not conform to three-way discretizations.

Information-theoretic complexity versus fidelity/ fit of the inferred models
In almost all model selection problems, one seeks answers to two key questions: (i) What is the fidelity of the model in its ability to explain observed data? (ii) How complex is the selected model?. The second question is necessary for when there is a simpler model (in complexity terms) that can explain/fit the same data equivalently or better than a more complex model, then the simpler model is preferred not only due to Ockham's razor, but also made rigorous by the Bayes theorem (Allison 2018). The information-theoretic framework of MML provides a direct way to quantify model complexity and fit in terms of bits. For any proposed model, the total two-part message length combines (i) the lossless encoding of the model, the length (bits) of which yields the model's (descriptive) complexity, and (ii) the lossless encoding of the observed data given that model, the length (bits) of which yields its fidelity by quantifying how well the model fits the data (see Section 2.2). Table 2 gives the complexity and fidelity statistics of our inferred models and compares it directly with the state-of-theart Dunbrack rotamer library at 5% ('default') smoothing level (see Supplementary Section S12 for results on other smoothing levels). Before we discuss these quantitative results, let us explore how/why they can be evaluated fairly, and on an equal footing.
For each of the 1296 bins in the Dunbrack library, the information in their library can be directly translated as a binwise mixture model with a fixed number of mixture components, where each component contains a product of m von Mises circular distributions, and m is the number of sidechain dihedral angles for the specified amino acid (aa). [We note that amino acids alanine (ALA) and glycine (GLY) have no sidechain dihedral angles, so the Dunbrack library do not have any models for ALA and GLY.] However, as mentioned above, the number of components of the each of those 1296 mixture models related to an amino acid is static/fixed and corresponds to the number of discrete states over m sidechain angles (often three-way for each sidechain dihedral angle v, as discussed earlier). Thus, the number of mixture components for each of the h/; wi bin is usually 3 m which yield a large number of models across all bins (e.g. 34,992 for methionine as shown in Table 2). This number matters, as it is proportional to the number of von Mises parameters (and respective mixtures' weights) that informs the complexity of the statistical model being proposed. In contrast, the MML mixture model infers only one mixture model for any amino acid, jointly over all (backbone þ sidechain) dihedral angles with all of its mixture parameters estimated unsupervised, including the number of mixture components jM ðaaÞ j.
Comparing the model fit/fidelity is more involved: while our work models the joint distributions over all (backbone þ sidechain) dihedral angles, Dunbrack's only deals with sidechain dihedrals conditioned on discretized states of the backbone. With this difference in the models, there are two possible directions to take to ensure the comparison of fidelity between the two is on the same footing. For any set of observations of all dihedral angles for a specified amino acid X ðaaÞ : 1) The / and w under Dunbrack model are stated over a uniform distribution-for this is precisely their underlying model-so that the message length of stating each vector of dihedrals using both models can be objectively compared. We show these results for PDB50 in the main text (see Table 2). Results for PDB50HighRes are included in Supplementary Section S4. 2) From each MML-inferred mixture model, we drop/omit the von Mises circular terms corresponding to backbone dihedral angles when estimating the length, yielding the second part of the message for only the sidechain dihedral angles of the observations. These results are presented in Supplementary Sections S3 (for PDB50) and S5 (for PDB50HighRes).
The above two ways of comparing the fidelity of the two models yield a similar conclusion: the MML-inferred mixture Table 1. PDB50 dataset statistics: amino acid type (aa), number of observations of that amino acid in PDB50 (N ðaaÞ ), and the total number of (backbone þ sidechain) dihedral angles in that amino acid (d ðaaÞ  The counts in d ðaaÞ ignore the x dihedral angle. i362 Amarasinghe et al.  3 (3,888; 10,368) 190,183 5,025,548 5,215,731 17.6 5,489,018 18.5 For each of the 20 naturally occurring amino acids (aa), N ðaaÞ gives the size of the input set (X ðaaÞ ) on which the comparison is based. jM ðaaÞ j gives the number of components of the mixture model, and jK ðaaÞ j gives the number of parameters across all components of the mixture model, inferred unsupervised. jD ðaaÞ rotamer j is the cumulative sum of all components described by the Dunbrack rotamer library, whereas #Params gives the corresponding total number of parameters implicit in their library. Across both models, the complexity (first part length in bits), fidelity (second part length in bits), and their two-part total are shown. The number of bits-per-residue for each of the models is also shown (the respective total message length by N ðaaÞ ). Finally, to measure the extent of lossless compression each model provides, the null model message length of stating the vector of dihedral angles encoded under a uniform distribution is shown as a bottom-line. Note the 'N/A' terms across alanine (ALA) and glycine (GLY) arise because those amino acids do not have sidechain dihedral angles. While we model the joint distributions of dihedral including the backbone, Dunbrack on the other hand only provide sidechain distributions conditional on the backbone. Hence for ALA and GLY, Dunbrack library estimates are necessarily empty. models (across all amino acid) are not only significantly more concise, but also explain the observed data better than the Dunbrack rotamer library (across the levels of smoothing they provide). Supplementary Section S9 provides a detailed explanation of how the lossless message length terms for Dunbrack's model are calculated.
Comparing the model complexity, Table 2 clearly shows that MML-inferred models are three orders of magnitude (in bits) more concise than those of the Dunbrack rotamer library. This is mainly due to the proliferation of the number of parameters in the Dunbrack model (see the eighth column of Table 2 under #nParams) compared with the lower number in the MML mixture model (third column under jK ðaaÞ j).
Further, comparing the model fidelity, all MML mixture models yield a better (lossless) explanation of the observed data than the corresponding Dunbrack models. The improvement varies with amino acids with most improvement observed for proline (PRO) where the second-part message length from MML mixture model is $35% shorter than Dunbrack. On the other end, for arginine (ARG) the improvement is $11%. The median improvement is $ 18% for glutamine (GLN). The mean sits at 20.1% improvement on PDB50 and 19.3% on PDBHighRes (Supplementary Table S2). Thus, from the results, it can be unambiguously concluded that the MML mixture models from this work outperform the state of the art in an objective quantitative comparison. Supplementary Sections S3 and S5 provide the alternative comparison between complexity and fit of the two models, involving the lossless comparison of sidechain dihedral angles and ignoring the backbone for PDB50 and PDB50HighRes.
Finally, we also assess how similar/different the inferred MML mixture models are across individual amino acids on the two datasets we have considered: PDB50 and PDB50HighRes. We use the measure of Kullback-Leibler (KL) relative entropy divergence that provides a direct way to compare two probability distributions. Supplementary Table  S4 provides the KL-divergence values. The small KLdivergence across all amino acids indicates the proximity/similarity of the two inferred distributions. More generally, it has been demonstrated that the MML estimator is statistically robust to detect signal reliably even when the precision of input data varies (Wallace 2005).

Visualization of fidelity of the models
Here, we compare the fidelity of MML mixture models and Dunbrack rotamer library by randomly sampling 100,000 data points (vectors of dihedral angles) and contrasting the resultant distributions from the two models against the observed (empirical) distribution. The method of sampling from any MML-inferred mixture model and (for comparison) Dunbrack's library is described in Supplementary Section S10.
To be able to assess similarities and differences visually, we examine two specific amino acids, methionine (MET) and glutamine (GLN). We choose these pairs because (i) they both have three sidechain angles hv 1 ; v 2 ; v 3 i, thus allowing their joint visualizations in 3D and (ii) MET falls into the 'rotameric' class of amino acids, whereas GLN falls into the 'non-rotameric' class (Shapovalov and Dunbrack 2011), hence providing a representation from those two classes for inspection.
Below we show these qualitative comparisons for the models inferred on the PDB50 dataset. The corresponding ones for PDB50HighRes are included in Supplementary Section S6. Figure 2 clearly shows that the sampled points/vectors from the MML-inferred mixture model for both these cases are significantly closer to the empirical distribution of those respective amino acids than the points/vectors randomly sampled from the Dunbrack library, which are comparatively sparser. Although the sampled points cover the main rotameric preferences, they do fall short in modelling the details of the spread seen in the empirical distribution, which the MML mixture model does well in explaining. This visualization is a qualitative demonstration of the clear quantitative difference we observed in their second part message length terms (which quantifies fidelity/fit in bits of information) shown earlier in Table 2: MET (19.1% difference) and GLN (18.3%). We already saw that the complexity (first) part of these models are orders of magnitude different (in bits), again in favour of the MML mixture model. This in itself demonstrates the power of inference made under the MML framework, and the natural trade-off between complexity and fit the framework permits. It is also a demonstration of the effectiveness of the EM method employed to infer these mixtures.
Finally, to give an overall view of the qualitative differences across all amino acids, we plot the probability distribution for each sidechain angle for which the MML mixture model can project onto the respective dihedral angle dimension, and compare it against the empirical (observed) distribution of that angle. For each amino acid, we randomly sample data points (vector of dihedral angles) from mixture models and plot against the corresponding empirical distribution. Figure 3 shows these plots across all amino acids, with the mixture model shown as a red curve, and the empirical distribution shown in yellow. For comparison, we include the distribution of sidechain dihedral angles by randomly sampling from the Dunbrack library across amino acids, shown in the same figure (in blue). The plots show that our mixture models fit better the empirical distribution than the Dunbrack models. (The visualization for PDB50HighRes is provided in Supplementary Section S7, and follows the same conclusions as above.)

Conclusion
We have successfully modelled the joint distribution of mainchain and sidechain dihedral angles of amino acids using mixture models. By measuring the Shannon information content, we showed that our mixture models outperform the models implied by the Dunbrack rotamer libraries (across levels of smoothing), both in terms of its model complexity (by three orders of magnitude) and its fidelity (yielding on average 20% more lossless compression) when explaining the observed dihedral angle datasets with varying resolution and filtering thresholds. We also demonstrated the robustness of the MML method of estimation, and show that the inferred mixture models are not prone to the pitfalls of under/over-fitting and other inconsistencies common to many statistical model selection exercises. The brevity of our mixture models also provide computationally cheap and reliable way to sample jointly h/; w; v 1 ; v 2 ; . . .i dihedral angles (and also conditionally given h/; wi) and are ready for use in downstream studies: experimental structure refinement, de novo protein design, Amarasinghe et al. Figure 2. (a) The projection, into the sidechain ðv 1 ; v 2 ; v 3 Þ space (unwrapped), of 100,000 randomly sampled points (vector of dihedral angles) for the amino acid methionine (MET) from MML mixture model (first row, center), of the same number of points from the Dunbrack model (first row, right), and of the observed (empirical) distribution of the same angles (first row, left). In the plots of the second row, the same data are visualized differently over three separate plots, with each of the three sidechain dihedral angles as x-axis (unwrapped), with y-axis showing the corresponding relative probabilities (in a 1 intervals). (b) The third and fourth rows plots are similar to first and second, respectively, but for the 'non-rotameric' amino acid, glutamine (GLN).
Getting '/wval' with proteins i365 protein structure prediction, among others. Our mixture models, PhiSiCal (/wval), are available for download from http://lcb.infotech.monash.edu.au/phisical. Also available from this link are programs to sample from the mixture models and report descriptive statistics (probability, log-odds ratios between pairs of models, null probability to estimate statistical significance, etc.) for use in modelling and simulation exercises. We foresee several applications of candidate samples of amino acid conformations generated from PhiSical models. These include computational support to model amino acid 3D coordinates into electron density maps, predicting sidechain conformations given backbone states of amino acids, assessing protein structures to detect conformation-outliers, driving perturbations in molecular dynamic simulations, among others. We aim to address these as future work. Figure 3. Fidelity of the inferred MML mixture models: the projected distribution of individual sidechain dihedral angles across all amino acids derived by randomly sampling N ðaaÞ datapoints (see Table 1) from MML derived mixture models and Dunbrack (5% smoothed) library, and compared with the empirical distribution. i366 Amarasinghe et al.